Improving Text Clustering with Social Tagging
نویسندگان
چکیده
In this paper we study the use of social bookmarking to improve the quality of text clustering. Recently constrained clustering algorithms have been presented as a successful tool to introduce domain knowledge in the clustering process. This paper uses the tags saved by the users of Delicious to generate non artificial constraints for constrained clustering algorithms. The study demonstrates that it is possible to achieve a high percentage of good constraints with this simple approach and, more importantly, the evaluation shows that the use of these constraints produces a great improvement (up to 91.25%) of the clustering algorithms effectiveness. Introduction and Motivation Lately several web-based tagging systems such as Technorati, Flickr or Delicious have become very popular. In this paper we will exploit the information created by the community in Delicious: a social bookmarking service where the users can save the URLs of their favourite web-pages offering also the possibility of associating tags to them. On the other hand the clustering methods are a very important data mining tool in order to exploit the knowledge present in data collections. In the last years a new family of clustering algorithms, constrained clustering (Basu, Davidson, and Wagstaff 2008), has achieved great importance because they enable the introduction of domain knowledge in the clustering process. The work presented in this paper uses the Delicious tags to generate positive soft constraints between documents (documents that share some tags are likely to be in the same cluster) and evaluates the effect of using those constrains in two different constrained clustering algorithms (Constrained Normalized Cut (Ji and Xu 2006) and Soft Constrained K-Means (Ares, Parapar, and Barreiro 2009)). The evaluation carried out showed large improvements over their non-constrained counterparts (Normalized Cut (Shi and Malik 2000) and K-Means (MacQueen 1967)) when using these “social-constraints”. To the best of our knowledge, this is the first time in which the information in social tags is used in the form of constraints to improve the outcome of a clustering process. Previous efforts to incorporate that information (Ramage et al. Copyright c © 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2009) have been oriented to use tags in an extended vector space model that includes tags and page text or to model jointly words and tags with latent Dirichlet allocation. Social Tags and Constrained Clustering Given the tags associated to the documents by the users of Delicious, the most straightforward option to translate that information into constraints could be creating a positive constraint between two documents di and dj (meaning that they should be in the same cluster) if they share some tag. This simple approach is quite naive because some common tags can produce a lot of non valid constraints. Hence, the approach we have followed was generating a constraint between two documents if they have in common at least t tags. Another important question is the absoluteness of the constraints. Even if we use this approach to turn tags into constraints, a fair amount of them are bound to be inaccurate (i.e., linking documents which should not be in the same cluster) until a high value of the parameter t, due to the polysemy of the terms used as tags or to differences in the criteria of the taggers. Consequently, we have used soft positive constraints, meaning that the documents affected by one of them are likely to be in the same cluster, without forcing the clustering algorithm to actually put them so. In this paper we have used two constrained clustering algorithms: a spectral one, Constrained Normalized Cut (CNC) and a partitional one, Soft Constrained k-Means (SCKM). CNC, introduced in (Ji and Xu 2006), alters the eigenproblem at the core of the Normalised Cut (NC) method (Shi and Malik 2000) adding a new term which encodes positive constraints. SCKM, introduced in (Ares, Parapar, and Barreiro 2009), extends the Constrained kMeans algorithm (Wagstaff et al. 2001) to allow the use of soft constraints. The assignment policy is similar to that of k-Means, but the similarity score between a document and a centroid is altered depending on the nature of the constraints which affect the document. In both algorithms the strength of the constraints is controlled by a parameter (β in CNC and w in SCKM), with higher values of the parameter meaning a greater strength of the constraints. Evaluation Methodology We have used the classic methodology in the evaluation of clustering experiments. Starting from a set of documents 430 Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media
منابع مشابه
Spectral Clustering in Social-Tagging Systems
Social tagging is an increasingly popular phenomenon with substantial impact on the way we perceive and understand the Web. For the many Web resources that are not self-descriptive, such as images, tagging is the sole way of associating them with concepts explicitly expressed in text. Consequently, users are encouraged to assign tags to Web resources, and tag recommenders are being developed to...
متن کاملPart of Speech Tagging for French Social Media Data
In the context of Social Media Analytics, Natural Language Processing tools face new challenges on on-line conversational text, such as microblogs, chat, or text messages, because of the specificity of the language used in these channels. This work addresses the problem of PartOf-Speech tagging (initially for French but also for English) on noisy language usage from the popular social media ser...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملImproving web page clustering using Probabilistic Latent Semantic Analysis
Traditional clustering algorithms are usually based on the bag-of-words (BOW) approach. A notorious disadvantage of the BOW model is that it ignores the semantic relationship among words. As a result, if two documents use different collections of core words to represent the same topic, they may be assigned to different clusters, even though the core words they use are probably synonyms or seman...
متن کاملPOS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments
We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language detection and POS tag layers do...
متن کاملAutomated Tag Clustering: Improving search and exploration in the tag space
In this paper we discuss the use of clustering techniques to enhance the user experience and thus the success of collaborative tagging services. We show that clustering techniques can improve the user experience of current tagging services. We first describe current limitations of tagging services, second, we give an overview of existing approaches. We then describe the algorithms we used for t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011